Machine learning is the study of computer algorithms, with the aim of generating models, called learners able to make predictions or decisions without being explicitly programmed to do so. At the base of machine learning there is a whole statistical learning framework that explains the structure of the learning paradigm in terms of a data-generation model and a measure of success (or loss) and defines the learning activity as the maximization of this success (or equivalently, minimization of the loss).
Empirical risk minimization (ERM) is a principle in statistical learning theory which defines a family of learning algorithms and is used to give theoretical bounds on their performance. These bounds are given in statistical terms considering two term of randomness: first an approximation term that quantifies the goodness of the learner and second a probabilistic term that give the probability that the learner will be a good learner. Based on theconstraints and the guarantees that we require we can define different types of learnability:
The theory tells us that is not crazy to think at an algorithm able to learn. So, the following question that one is naturally inclined to ask is: if we have can construct algorithms able to learn, why do we need data scientists?
Unfortunately (or fortunately, depends on the points of view) the same exact theory tells us that we can't imagine any behavior remotely close to human intelligence, withouth some human intelligence backup. The theory tells us that a class of hypothesis is PAC learnable if for any problem we can find a number of examples that we should feed to him that guarantees that it is able (proably and aproximately) to learn the problem. The key point is that to have these guarantees of convergence of the learning method, we need to introduce some prior knowledge into the framework, biasing it. For this reason, this prior knowledge is referred to as inductive bias and it is tipically coded in terms of constraints to the hypothesis set (or type of learners) we decide to adopt to tackle a given problem.
Is it really necessary to have prior knowledge on the problem?
We can ask ourselfes if there exist an algorithm $\mathcal A$ and a training set size $m$, such that for every distribution $\mathcal D$, if $\mathcal A$ receives $m$ i.i.d. examples from $\mathcal D$, there is a high chance it outputs a predictor h that has a low risk. An algorithm of this type is said to be a universal learner.
The "no free lunch" was formalized by David Wolpert and William Macready in the 1997. The No-Free-Lunch theorem states that no such universal learner exists. To be more precise, the theorem states that for binary classification prediction tasks, for every learner there exists a distribution on which it fails, even though that distribution can be successfully learned by another learner.
This means that it's impossible to find a learner able to succesfully tackle every task, and it's a human problem decide which algorithm is the most suited for the given problem.
That's not all! In the feature space each variable is a dimension. Real data can have thousand of dimension. Consider for example images or text-documents where the dimensionality is given by the vocabulary of words. These high-dimensional spaces involves a lot of problems intrinsically related to the mathematical properties of these spaces.
High dimensional spaces are almost completely empty and the points are isolated
The curse of dimensionality is an expression coined by Richard Bellman, to refer to phenomenon that arise when analyzing and organizing data in high-dimensional spaces. In the feature space $\mathcal X$ each variable represent a dimension. As variables get added, data space becomes very sparse. This practically means that high dimensional spaces are almost completely empty.
There are different approches to show this beahvior. Assume your data lives in the space $[0, 1]^p$, that is the unitary hypercube in the $p$-space. To capture a neighborhood which represents a fraction $s$ of the hypercube volume, you need the edge length to be $s^{1/p}$, such that $(s^{1/p})^p=s$:
As we can see from the graph, this value rapidly explode for growing values of $p$. For example, as we can see in the plot above to capture the 10% of the volume of an 10D-hypercube, you need another 10D hypercube of segment $s = 0.1^{1/10} \simeq 0.8$. In other words in a 10D-hypercube of edge 0.8 is contained only the 10% of the information of the entire unitary 10D hypercube. This means that the 10D-hypercube is practically empty.
Note that also the contrary is valid. Considering a $p$-hypercube with and edge length of $r=0.1$ we have that its volumes is $0.1^p$. When $p$ grows, the volume quickly becomes so small that the probability to caputre points from your neighborhood becomes very small.
Moreover, consider $X,Y$ two independents variables, with uniform on $[0,1]^p$. It can be proved that the mean squared distance $|X-Y|^2 satisfies:
$$ \mathbb E \big[\|X-Y\|^2\big] = p/6,\qquad std\big[\|X-Y\|^2\big] \simeq 0.2\sqrt p $$We can analize the beahvior for different values of p. From a log-log plot we can clearly see a linear trend between the two varibles. This means that the expected valued of the mean squared distance grows exponentially with respect to the standard deviation. This means that for growing p, the average distance will increase very fast while the difference between the distances will be negliceble
In high dimension the notion of distance will lose meaning
Practically what happens is that in higher dimensional spaces the concept of distance is losing it's mean. To show better this comportment we can draw 100 points uniformely from a $p$ unitary hypercube and plot the normalized distribution of the distances
The machine learning statistical framework gives us guarantees on the learning activity, but it also show us that is fundamental to introduce some human intelligence in this framework in order to make it work.
It is exactly here, where the data scientist comes in, injecting the real intelligent part in the system. Its work is to prepare the framework in order to make it work correclty..
Starting from the raw data and to anlayze it, clean it, shape it and craft it. This preparation is done thorugh a whole data-preparation pipeline, in which only last is the choice of correct algorithm that best suite our problem.
In this sense the datascientist is a sort of data-craftsman that like every other artisan has its own (mathematical) toolbox that he/she needs to master.
Principal Component Analysis is probably the most famous and one of the oldest multivariate statistical technique. It was originally proposed by Karl Pearson in the 1907, but its current embodiement it's due to the work of Hotellyng, that in 1933 formalized the idea and firstly introduced the term principal component.
The idea at the base of PCA is to find an orthogonal set of vectors that spans the original space where the data are located. These vectors are not chosen randomly but are the ones that carry the most information possible of the original set. The information present in the data is considered in terms of variance. Hence, the vectors of the orthogonal basis are the directions of maximum variance of the data. These vectors are called principal components and are constructed as a linear combination of the original features. Moreover these vectors are ordered by importance, such that the first component is required to have the largest possible variance (the most information possible). The second component is the second for information carried and it's orthogonal to the first one, and so on.
PCA aims to find the most accurate data representation in a lower dimensional space
The idea of PCA is to find the most accurate representation of the initial data in a lower dimensional space. So for example if we have a set of 2D points and we map it to a 1D-line that is the first two principal components, we want that this projection is the best possible 1D representation of our original points. In this optic we can define the principal components as the ordered set of vectors that minimize the orthogonal projection error.
Consider a dataset $X\in\mathbb R^{m, n}$, where $m$ is the number of samples (or observations) each one with $n$ features (or parameters, or variables). Tipically a preprocessing is applied to prepare the data before the applciation of the PCA algorithm. The most common type of preprocessing is to subttract the mean of each variable from the dataset to center the data around the origin. Practically we apply a transformation of the type:
$$ x_i \quad\leftarrow \quad x_i - \mu, \qquad\forall \;x_i \in X^T $$where $\mu \in\mathbb R^n$ is the vector containing the empirical mean of each column of $X$ and $x_i$ is a sample (row) in the dataset. The final matrix will have column-wise zero empirical mean. A matrix of this type is said to be in mean-deviation form.
The change of coordinate system is done in the following manner. Consider the set of the principal components $\mathcal V = \{v_1, \ldots, v_n\}$ and the corresponding orthonormal matrix $V = [v_1, \ldots, v_n]\in\mathbb R^{n,n}$ whose columns are the principal components. The matrix $V$ is an orthonormal basis of the space $\mathbb R^n$, where the original data point (the observations) are located. For each row of the matrix $X$ $x_i\in \mathbb R^n$ we can retrieve the projection along the component $k$ as:
$$ f_{i,k} = x_i^T \cdot v_k $$.
We can do this for each component to obtain a new vector of principal component scores:
$$f_i = x_i^TV$$In this context the matrix $V$ it tipically referred to as the loadings matrix. Note also that the vector $t_i$ represent the original sample $x_i$ in the new coordinate system of the principal components. We can think of projecting all the points obtaining the matrix of factors scores:
$$ F = XV $$In the principal component coordinate systems, the variance along the direction $k$ can be expressed simply as $\sum_{i}(f_{k,i})^2$, since the points are mean-centered. As stated above, the goal is to find the directions of maximum variance. Considering the first principal component, we seek to find the direction $v_1$ that maximize the variance. We can write this problem mathematically as:
$$ v_1 = \displaystyle\arg\max_{\|v\|=1}\bigg\{\|Xv\|^2\bigg\} =\arg\max_{\|v\|=1}\bigg\{v^TX^TXv\bigg\} $$Defining the vector $v$ as unitary, we can rewrite the problem in the form:
$$ v_1 = \displaystyle\arg\max\bigg\{\frac{v^TX^TXv}{v^Tv}\bigg\} $$The function $\mathcal R(A,v) = \frac{v^TAv}{v^Tv}$ that maps $\mathbb R^n \mapsto \mathbb R$ is called Rayleigh quotient. The stationary points of this function are the eigenvectors of the matrix $A$ and their value are the corresponding eigenvalues. The problem of finding the principal components so, resolves with an eigendecomposition of the covariance matrix $X^TX$.
A set of orthogonal vectors that maximize the Rayleigh quotient of $XX^T$ is the set of eigenvectors of $X^TX$
The PCA algorithm can then be described in two simple steps:
First of all we need to preprocess our dataset $X\in\mathbb R^{m, n}$. Almost always, the columns of $X$ will be zero-centered such that the mean of each column will be 0. A matrix in this form is said to be in the mean-deviation form. Tipically the elements of $X$ are also divided by $n-1$. In this case the model is referred to as covariance PCA since the matrix $X^TX$ is the covariance matrix of $X$. In other cases in addition to the centering, we can apply a standardization of each variable. In this case we tipically referr to a correlation PCA since the matrix $X^TX$ is the correlation matrix.
For the second step performing an eigenvalue decomposition is not a trivial task. Fortunately it comes in our help an amazing math tool called Singular Value Decomposition. An SVD of a matrix $X$ is a factorization of the type$X = U\Sigma V^T$ where:
The usufulness of the SVD decomposition, derives from the fact that exists a simple relation between the SVD decomosition of the centered matrix $X$ and the eigenvalue decomposition of the matrices $X^TX$ and $XX^T$ :
class myPCA():
def __init__(self, n_components=None, use_std=False, solver='svd'):
'''
n_components: int or float[0,1]
- if int is interpreted as the number of components to keep
- if float is interpreted as the percentage of variance to keep
use_std: bool, default=False
- if False X will be only centered column-wise (covariance PCA)
- if True X will be standardized considering also the std of each variable (correlation PCA)
'''
self.use_std = use_std
self.n_components = n_components
self.var_explained = None
self.solver = solver
def __scale_matrix(self, X):
# Center the matrix
self.mean = X.mean(axis=0)
X_centered = X - self.mean
# Scale the matrix
if self.use_std:
self.std = X.std(axis=0)
X_centered /= self.std
return X_centered
def fit_transform(self, X):
self.fit(X)
return self.transform(X)
def fit(self, X):
'''
Performs computation of the principal components on the dataset X
X: np.array dim (m,n)
- m is the number of samples (observations) in the dataset
- n is the number of variables (features)
'''
self.n_samples, self.n_features = X.shape
if self.solver == 'svd':
X_centered = self.__scale_matrix(X)
U, S, V = la.svd(X_centered) # SVD decomposition of X
self.U_, self.S_, self.V_ = U, S, V
V = V.T # eigenvectors of X^TX
self.lambdas = (S ** 2) / (self.n_samples - 1) # eigenvalues of X^TX
elif self.solver == 'eig':
X_centered = self.__scale_matrix(X)
cov = (X_centered.T @ X_centered) / (self.n_samples - 1)
self.lambdas, V = la.eig(cov)
# Ordering eigs and eigenvectors
i_lambdas = np.argsort(self.lambdas)[::-1]
self.lambdas = self.lambdas[i_lambdas]
V = V[:, i_lambdas]
self.components_ = V
self.__find_components()
def __find_components(self):
'''
private method used to compute the number of components
'''
# compute variance explained for each component and cumulative variance explained
tot_variance = np.sum(self.lambdas)
self.relative_variance_explained = [x / tot_variance for x in self.lambdas]
self.cumulative_variance_explained = np.cumsum(self.relative_variance_explained)
if self.n_components is None:
self.n_components = len(self.lambdas)
elif type(self.n_components) is int:
self.n_components = self.n_components
elif 0.0 <= self.n_components and self.n_components <= 1.0:
# if self.n_components is float in [0,1], compute the first index of cumulative variance
# greater or equal to it
self.n_components = np.argmax(self.cumulative_variance_explained >= self.n_components) + 1
else:
# in all the other cases set by default to max possible
self.n_components = self.n_features
self.var_explained = self.cumulative_variance_explained[self.n_components - 1]
def transform(self, X, n_components=None):
'''
Projection of X into the principal component space
'''
if n_components is None or n_components > self.n_components:
n_components = min(X.shape)
B = X.copy()
B -= self.mean
if self.use_std:
B /= self.std
Vm = self.components_[:, :n_components]
return B @ Vm
def inv_transform(self, W, n_components=None):
'''
Reconstruction of the matrix from the principal component space
'''
if n_components is None or n_components >= self.n_components:
n_components = min(W.shape[1], self.n_components)
Vm = self.components_[:, :n_components]
B = W @ Vm.T
if self.use_std:
B *= self.std
B += self.mean
return B
The main characteristic of PCA is that it gives us a very simple way to quantify the relevance of each direction. Indeed, for each direction of the new coordinate system we know how much information is related to it. The percentage of variance explained by each component $v_i$ can simply be computed as the fraction of the eigenevalue $\lambda_i$ corresponding to that component over the trace of the covariance matrix:
$$ \text{fraction_var_explained}_i = \frac{|\lambda_i|}{\sum_{j}|\lambda_j|}$$This allow us a precise way to compute a dimensionality reduction, knowing exactly a-priori how much information are we gonna lose, giving us the possibility to do a trade-off between the dimensionality of the feature space and the information we want to mantain. Sorting the components by variance explained we can find that some of these components are entitled only for a small fraction of the total variance. This means that these directions are not much informative for the dataset and we can chose to drop them, mantaining only the first and more informative ones.
A classical application of PCA dimensionality reduction are images. Typically, an image is a tensor of pixels of dimension $n\times m\times 3$, where the 3 stands for the RGB channels. We can convert the image in grayscale obtaining a matrix of pixels of dimension $n\times m\times 1$ and then apply PCA on it.
The image above is a classical test image of dimension 512 x 512 pixels. To show in practice the power of PCA, we can try to apply it on this image.
In the picture above we can clearly see an example of the behaviour aforementioned. Only the first component account for around the 25% of the total variance, and the first 5 components account for over 66%. Moreover, the first 20 components accoutn for the 90% of the information present in the image. This means that the contribution summed of the remaining $492$ components account for less than the 10% of the total information.
To show clearly what does this means practically, we can try to project the image on lower spaces considering increasing number of components, and then reconstructing an aproximation of the original image. As we can see, considering 100 principal components (99% of variance explained) the image reconstructed is practically identical to the original one. However, instead of dealing with a matrix of dimension $512\times512$, we now have to manage a matrix of dimension $512\times100$
Principal Component Analysis is a very simple and effective tool but very often it is applied in a black-box approach to perform dimensionality reduction. However, PCA can be used also as an exploratory tool. Indeed, it can give us some insight on the structure and distribution of our data. For this reason having a strategy to analyze the results of our pca model it's fundamental.
For the next part I will use a dataset containing all the players present in EA Fifa 19 (https://www.kaggle.com/karangadiya/fifa19). For each player, 88 attributes are recorderded. Some of them are generic attributes as "Name", "Nationality", "Photo" that is the url of the thumbnail or "Release Clause".
The attributes interesting for this part are the ones from column 53 to 87. These 34 attributes are related to different skills covering different technical areas like 'Finishing' or 'LongShoot' for strikers, 'Passing' and 'Vision' for midfielders, 'Tackling' and 'Interception' for defenders and 'GKDiving' and 'GKReflexes' for goalkeepers. For each player a score in the range $[0,100]$. The whole set of scores contributes to the overall score of the player.
The set of skills interesting for our analysis are the following
In Fifa 19 a lot of different Positions are considered to each area of the field.
To simplify the analysis each player is mapped to the area of the field which is more close to his position. In this manner, only 4 categories are considered as GeneralPosition : Striker, Midfielder, Defender and Goalkeeper.
def map_position_to_general(position):
if position in ['ST', 'LF', 'CF', 'RF', 'LS', 'RS', 'LW', 'RW']:
return 'ST'
elif position in ['CAM', 'LM', 'CM', 'RM', 'CDM', 'LAM', 'RAM',
'LCM', 'LDM', 'RCM', 'RDM']:
return 'MF'
elif position in ['LWB', 'LB', 'CB', 'RB', 'RWB', 'LCB', 'RCB']:
return 'DF'
else:
return 'GK'
df['GeneralPosition'] =df.Position.apply(map_position_to_general)
player_color_dict = {'GK': 0, 'DF': 1, 'MF': 2, 'ST': 3}
color_player_dict = {DEFAULT_PLOTLY_COLORS[v] : k for k,v in player_color_dict.items()}
df['Color'] = [DEFAULT_PLOTLY_COLORS[player_color_dict[k]] for k in df.GeneralPosition]
df[['Name', 'Position', 'GeneralPosition']].head()
Another way to look at the formulation of this problem is to consider each sample $x_i \in X.T, i=0,\ldots,n$ as a physical point and so the whole dataset as a phisical body. We can define the inertia of a column as the sum of the squared elements of the column:
$$ I_j^2 = \sum_{i=0}^m x_{i,j}^2 $$The sum of all the column inertias is called total inertia or simply inertia of the data table:
$$ I = \sum_{j=0}^n I_j^2 = \sum_{j=0}^n\sum_{i=0}^m x_{i,j}^2 $$The center of gravity $g$ is the vector of the means of each column of $X$, and the distance of each point from the baricenter is given by:
$$ d_{i,g}^2 =\sum_{j=0}^m (x_{i,j} - g_j) ^2 $$Consider to put the dataset in meand-deviation form, or in this case to shift the body s.t. the baricenter is positioned to the origin. We have now that $g = \mathbb 0 \in\mathbb R^n$ and the distance of each point is:
$$ d_{i,g}^2 =\sum_{j=0}^m (x_{i,j}) ^2 = I_i $$The principal components are the axes of inertia of the body. The importance of each component is given by the proportion of inertia "explained"
In the picture above are plotted the pair-plots of the first three features. Pair-plots are a very simple way to investigate how variables interact in couples. The plots are organized in a grid of dimension $\text{num_features}\times \text{num_features}$. In the plots $(i,j)$ outside the diagonal ($i\neq j$) are depicted the scatter plots of the coupled variables $i$ and $j$. Here we can investigate the interactions between these two variables.The diagonal plots are treated differently: a univariate distribution plot is drawn to show the marginal distribution of the data in each column.
The dotted line in black in the picture is the first principal component projected, in each subplot along the two corresponding variable. As we can clearly see, if we consider the storm of points as a physical body, the first principal component is the ax the capture the maximum inertia possible of the body.
The matrix $V$ is called the loadings matrix. As seen before, each principal component is constructed as a linear combination of the original variables. The matrix $V$ contains the coefficients of the linear combination. In particular, for example considering the value $v_{i,k}$ it is the contribution of variable $i$ to the component $k$.
For each component we can plot the loadings as a barplot to look at the contirbution of each one of the original variables. Opposing positive and negative contributing features can help us to give an interpretation to the principal component
The vertical barplots give us some important insight on the behaviour of the first two principal components. From the first plot we can see a clear positive contribution of Goalkeepr's attributes and a negative contribution of all the other attributes. The second plot, at a first glance seems to be less clear. However highlighting the most important attributes in absolute level, we can see a positive contribution of defensive attributes as 'Tackle', 'Marking' and 'Interception' and a negative contribution of attributes like 'Finishing', 'Longshots' and 'Volleys'. From the 2 plots above we can clearly interprete:
The correlation between a component and a variable estimates the information that they share. Since the sum of the squared variables is equal to 1, a good way to plot them in a 2D space is to consider the unitary circle of correlations.
When the data are perfectly represented by only two components, the sum of the squared loadings is equal to one, and so the loading will be positioned on the circle. When two components are not enough to reconstruct a variable, the variable will be positioned inside the circle.
The closer a variable is to the circle of correlations, the better we can reconstruct it from the first two components
Consider the svd decomposition $X = U\Sigma V^T$, we have that the matrix of the factor scores:
$$ F = XV = U\Sigma V^T V = U\Sigma$$and so we have that the sum of the squared factors for each component is equal to the corresponding eigenvalue"
$$ F^TF = \Sigma^TU^TU\Sigma = \Sigma^T\Sigma \underset{\Sigma \text{ diagonal}}{=} \Sigma^2 $$Having this in mind, we can quantify the contribution of each observation to a component as the partial variance (inertia) over the total variance (inertia) of the component:
$$ ctr_{i,k} =\frac{f_{i,k}^2}{\sum_{j=0}^mf_{j,k}^2} = \frac{f_{i,k}^2}{\lambda_k} $$Obvsiously the contribution of each observation is in the range [0,1] while the sum of all the contributions for a component is $\sum_{j} ctr_{j,k} = 1$.
A tipical heuristic approach is based on interpreting the principal components considering the observation whose contribution is greater than the average contribution. In this case the observations that differ in sign can be opposed as the endpoints of the component
def contribution(X, pca, k):
f = pca.transform(X)[:,k] **2
return f/sum(f)
The contribution plot enforce our beliefs. We can see how the players that contribute the most to the first component are goalkeepers, while in the second component there is a hetergoeneous contribution of 'ST', 'MF' and 'DF'.
Highlighted in the plot there is a point whose contribution is very high in the second component, but also quite high in the first. Indeed, this player has the highest contribution of the entire dataset
Digging into the dataset we can retrieve the informations relative to this player and we can see how this is not 'a player' but it's Lionel Messi, the highest overall-rated player in Fifa 19. Aside from it's overall, it's contribution to the goalkeeper component is probably due also to the fact that it has a score relative to 'GKKicking' and 'GKPositioning' and 'GKHandling' higher than the average of non-GK players
Finally we can plot the simple scores of the each observation along the first two components. This practically are the coefficients of each observation in the PCA base. Also from this plot we can clearly see how the first component is an attribute of 'Goalkeepingness' while the second component is an attribute of 'Defensiveness'
In many cases, we are only interested in extracting the important information from the dataset, reducing the dimensionality of the dataset as much as possible. So in this case we only want to know how many component we want to keep. This question is not solved, but there are some useful guidelines to help us in our decision:
When the original dataset is considered to be the population of interest (we have all the data at the beginning and we work only on it as in the previous examples) PCA corresponds to a Fixed Effect Model. In this case in fact, the model parameters are fixed or non-random quantities. In this case PCA is descriptive and the the amount of variance of $X$ explained by a component indicates exactly its importance. In this case, we can quantify the quality of the PCA considering only the first $k$ components, as the reconstruction error. Consider the matrix $\hat X_{(k)}$ that is the approximation of $X$ considering only the the first $k$ components.
$$ \begin{aligned}\hat X^{(k)} &= U^{(k)}\Sigma^{(k)}V^{(k)T}\\ &= F^{(k)}V^{(k)T}\\ &= XV^{(k)}V^{(k)T} \end{aligned} $$Practically we are prokecting our original dataset $X$ onto the first $k$ principal components $V_{(k)}$, and then we are reconstructing an approximation of the original matrix with the inverse operation $V_{(k)}^T$ (inverse since $V$ is orthonormal). This approximation carries an error, due to the fact that we are considering only the first $k$ components
$$ X = \hat X^{(k)} + E $$where $E$ is a matrix accounting for the error $X - \hat X_{(k)}$. We can quantify this error in different manner, but the most widely adopted approach is to consider the Residual Error Sum of Squares:
$$ \begin{aligned}RESS_k &= \|X - \hat X_k\|_F^2\\ &= \mathcal I - \sum_{i=1}^k \lambda_i\\ &= \sum_{i=k+1}^n \lambda_i \end{aligned} $$where $\|\cdot\|_F$ is the Frobenius norm. This error is quantifiable also as the sum of the eigenvalues which we are neglecting. The lower is the RESS the better is the model. Obviously increasing the number of components we will have a better estimation and considering $k=n$ all the components we will be able to reconstruct the original matrix $X$ perfectly.
Tipically the dataset that we have is only a subset of observations of the real population. Our goal is to estimate the value of new observations from this population. This framework correspod to a Random Effect Model. In this case, practically we are only approximating the real parameters of the PCA and due to the randomness, we have to adopt different strategies to evaluate our PCA model. Usually these are based on computer-based resampling techniques such as bootstrap and cross-validation.
A popular cross-validation technique adopted in this context is the jackknife (or Leave One Out). In this procedure each observation is selected in turn, to form the test set. The remaining observations constitute the training set. Then we will infere the PCA model on the training set and computing an estimation of the left-out sample considering the first $k$ component, according to a random effect model. We can store all this estimations in a matrix $\tilde X$ and finally computing the estimation error w.r.t. the original matrix $X$. Note that in this case we tipically talk of Predicted Residual Error Sum of Squares:
$$ PRESS_k = \|X - \tilde X_k\|_F^2 $$As in the previous case the lower is the PRESS the better is the model. Note also that in contrast with the fixed effect model, even considering all the components we have no guarantees to reconstruct perfectly X.
This problem shows up particularly when the number of variables is larger than the number of observations. This problem is referred to as the 'small N large P' in the literature
LVOO is a really computationally expensive method. In fact considering, $m$ samples we have to fit $m$ pca models on $m-1$ samples only to test on a single sample. Obviously, in the case of very large $m$ this become not duable. When we are in these situations we can adopt other techniques. k-fold Cross Validation consists in splitting the training set into k equal subsets and then applying the following iterative procedure:
loop i = 1, ..., k:
fit model on k-1 folds (excluding the i-th fold)
test model on the i-th fold
The method above present a subtle flaw. Consider a LVOO approach. At iteration $i$ we will consider the training set $X^{(-i)}$, that is the original data without the sample $x_i$. Then, we fit our PCA on $X^{(-i)}$, in order to retrieve the matrix $V^{(-i)}$ and finally we can compute the approximation error as:
$$ e_{i} = \|x_i - \hat x_i \|^2 = \|x_i - x_i^TV^{(-i)}V^{(-i)}T\|^2 $$and from all the approximation errors compute the PRESS as:
$$ PRESS_{bad} = \sum_i e_{i} = \sum_i \|x_i - \hat x_i \|^2 = \|x_i - x_i^TV^{(-i)}V^{(-i)T}\|^2 $$The problem is that this approach use the test set (the original $x_i$ sample) in order to compute the approximation of $x_i$ itself. It is like if in a classification problem we would use the correct test label in order to predict the same label. This approach is very prone to overfitting the data, tipically diminishing the PRESS score for increasing number of components, usually overestimating the number optimal number of components and so lowering the generalization abilities of our model.
There are different approaches to tackle this problem. Tipically they are based on the idea to consider a single column $x_{i,j}$ at time of the test set $x_{i}$ and reconstruct it:
$$ PRESS_{good} = \sum_{i=1}^m\sum_{j=1}^n | x_{i,j} - [V^{(-i)}(V^{(-i)}_{-j})^+x_{i,-j}]_j|^2$$This approach is very computationally expensive. A variation of this approach consists into masking a percentage of our matrix. Consider a masking matrix $M$:
$$ M_{i,j} = \begin{cases}0 & \text{if } validation\; point\\1 & \text{if } training\; point \end{cases}$$We can apply it to $X$ in order to left out a percentage of elements of the matrix.
This is a well known optimization problem usually called matrix completion:
$$\min_{F,V}\|M\circ(FV^T - X)\|_F^2$$where $\circ$ denotes the Hadamard product (element-wise product). The error of our model on the masked points as:
$$\min_{F,V}\|(1-M)\circ(FV^T - X)\|^2$$A possible solution to this problem is constituted on an iterative method based on alternating least square minimization [3]
### Code revised from https://github.com/idc9/ya_pca
def lstsq(A, B, M):
if A.ndim == 1:
A = A[:,None]
# else solve via tensor representation
rhs = np.dot(A.T, M * B).T[:,:,None] # n x r x 1 tensor
T = np.matmul(A.T[None,:,:], M.T[:,:,None] * A[None,:,:]) # n x r x r tensor
try:
# transpose to get r x n
return np.squeeze(np.linalg.solve(T, rhs), axis=-1).T
except:
r = T.shape[1]
T[:,np.arange(r),np.arange(r)] += 1e-6
return np.squeeze(np.linalg.solve(T, rhs), axis=-1).T
def cv_pca(X, n_comp, p_holdout=0.2, n_iter=100, tol=1e-8, replicates = 3):
# added tolerance to break iteration
# add replicates with computation of mean error
m,n = X.shape
ranks = np.arange(1, n_comp)
errors = {'train':{i:[] for i in ranks},
'test':{i:[] for i in ranks}
}
M = np.random.rand(*X.shape) > p_holdout
# fit models
for rank, _ in itertools.product(ranks, range(replicates)):
U = np.random.randn(m, rank)
#iterative method
for itr in range(n_iter):
Vt = lstsq(U, X, M)
U = lstsq(Vt.T, X.T, M.T).T
resid = np.dot(U, Vt) - X
train_err = np.mean(resid[M]**2)
test_err = np.mean(resid[~M]**2)
errors['train'][rank].append(train_err)
errors['test'][rank].append(test_err)
errors['train'] = {k:(np.mean(v), np.std(v)) for k,v in errors['train'].items()}
errors['test'] = {k:(np.mean(v), np.std(v)) for k,v in errors['test'].items()}
return errors
PCA is a very powerful method, but obviosuly is not flowless. PCA is a linear method and so it does perform well well if our data is linearly structured. What does this means practically? Consider a 2D-binary classification problem $\mathcal Y=\{0,1\}$ where each sample of our dataset is labeled with one of the two classes. The two classes are structured spatially in a non-linear way. For example in the plot below we can see two clusters of point organized in a XOR pattern around the origin
It's clear form the plot that PCA will never be able to find a direction onto which projecting these point, such that the point are separated. The problem here is that the baricenter of the two clusters coincide. Computing PCA on this dataset as it is and projecting the points, we will destroy the structure of the data and the points of the two classes are no more distinguishable.
As we have seen in the first part, tipically points in high dimensional space are very sparse. Intuitively we can think that they are also more easily linearly separable. The idea here is exacltly this, we want to move our data into an higher dimensional space. We can do this crafting some new non-linear features. Imagine that we have some insight on the strucure of our data. For example in the case above we can consider a mapping function $\psi: \mathbb R^2 \rightarrow \mathbb R^3$ that maps $(u, v) \mapsto (u, v, uv)$. The range space of $\psi$ is called feature space. We can then map each of our sample to the feature space $x_i\mapsto\phi(x_i)$ and then apply PCA in the feature space
def mapping_function(point):
x, y = point
return (x, y, x*y)
X_features = np.array(list(map(mapping_function, X)))
x = X_features[:,0]
y = X_features[:,1]
z = X_features[:,2]
As we can see from the plot above, in the feature space the point are now linearly separable, thus PCA can perform well.
The basica ide can be summed up in three main points:
Practically the success of this method boils down to the choice of a good mapping function $\psi$.
Computing the mapping into the feature space is computationally really expensive. Moreover, since the matrix $X_\psi$ is in a higher dimensional space, also the computation of the eigendecomposition of its covariance matrix is a computational burden. The solution to all these problems are kernel functions. Given a mapping function $\psi:\mathcal X\rightarrow \mathcal F$, where $\mathcal F$ is an Hilbert space, we define a kernel function as:
$$ \mathcal K(\mathbf x_i, \mathbf x_j) = \langle \psi(\mathbf x_i),\psi(\mathbf x_j)\rangle=\psi(\mathbf x_j)^T\psi(\mathbf x_i), \qquad \mathbf x_i, \mathbf x_j \in X^T$$The kernel function is a similarity measure between two points corresponding to an inner product in the space $\mathcal F$
The key point to a function to be a proper kernel is that it must represent a proper inner product and that $\mathcal F$ is a proper inner product space.
$$\begin{aligned}&{\textbf{(Mercer) Theorem:}} \text{ For any symmetric function } k: X \times X\rightarrow \mathbb R \text{ which is square integrable in } \\ &X \times X\text{ and which satisfies:}\\\;\\&\qquad\qquad\qquad \int_{ X \times X}k(x, x')f(x)f(x')dxdx, \quad \forall f\in L_2( X)\\\;\\ &\text{there exist }\phi_i:\mathcal X \rightarrow\mathbb R \text{ and numbers }\lambda \ge 0 s.t.:\\\;\\&\qquad\qquad\qquad k(x, x') = \sum_i\lambda_i\phi(x)\phi(x'), \quad \forall x, x'\in X \end{aligned}$$Note that the double integral is a generalization in the continuos case of the vector-matrix-vector multiplication. Considering the discrete case, the Mercer's condition requires that:
$${\displaystyle (g,Kg)=g^{T}{\cdot }Kg=\sum _{i=1}^{N}\sum _{j=1}^{N}\,g_{i}\,K_{ij}\,g_{j}\geq 0}$$where $K$ is called Kernel matrix $K$ or Gram matrix:
$$ K = G = \begin{bmatrix} \mathcal K(\mathbf x_1, \mathbf x_1) & \mathcal K(\mathbf x_1, \mathbf x_2) & \ldots & \mathcal K(\mathbf x_1, \mathbf x_m)\\ \mathcal K(\mathbf x_2, \mathbf x_1) & \mathcal K(\mathbf x_2, \mathbf x_2) & \ldots & \mathcal K(\mathbf x_2, \mathbf x_m)\\ \vdots & \vdots&\ddots & \vdots\\ \mathcal K(\mathbf x_m, \mathbf x_1) & \mathcal K(\mathbf x_m, \mathbf x_2) & \ldots & \mathcal K(\mathbf x_m, \mathbf x_m)\\ \end{bmatrix}$$Practically a kernel functionis a generalization of a symmetric positive semidefinite matrix. Restricting to the set $X$ we have from the Mercer's condition that $\mathcal K$ is a proper kernel function if $K$ is a positive semidefinite matrix.
Some of the most famous kernels are:
An Hilbert space $\mathcal H$ is a vector space where is defined an induced inner product $\langle \cdot, \cdot \rangle_\mathcal H$with the following properties:
Now consider an arbitrary set $X$ and the hilbert space $\mathcal H$ of real-valued functions on $X$. For an element $x\in X$ we can define the Dirac evaluation functional
$$ \\\;\\ \begin{aligned} \delta_x \quad :\quad &\mathcal H \rightarrow X\qquad s.t.\; \delta_x(f) = f(x)\\ & f \mapsto f(x) \end{aligned} \\\;\\ $$that is practically a linear functional that evaluates each function in $\mathcal H$ at point $x$. We say that $\delta_x$ is bounded if there is a constant $M>0$ such that:
$$ \\\;\\ \|\delta_xf\| = |(f(x)|\le M\|f\|_\mathcal H\quad \forall f\in\mathcal H \;\text{ and }\forall x \in \mathbb R^m \\\;\\ $$From this we have the following definition:
$$\\\;\\ \begin{aligned}&{\textbf{Definition:}} \\& \text{A }\textbf{Reproducing Kernel Hilbert Space (RKHS)} \text{ is an Hilbert space }\mathcal H \text{ where all the Dirac}\text{ evaluation functionals in } \mathcal H\\& \text{ are bounded and continuous.}\end{aligned} \\\;\\ $$$$ \\\;\\ \begin{aligned}&{\textbf{(Riesz representation) Theorem:}} \\&\text{Given a bounded linear functional } \varphi \text{ on an Hilbert Space } \mathcal H\text{, then there exists a unique } f_\varphi \in \mathcal H \text{ such that:}\qquad\qquad\quad \\\;\\&\qquad\qquad\qquad \varphi(f) = \langle f, f_\varphi\rangle_\mathcal H\qquad \forall f\in\mathcal H \end{aligned} \\\;\\ $$Considering the Dirac evaluation functionals, we get that for each $\delta_X$ there exists a unique $k_x\in\mathcal H$ such that:
$$ \delta_xf = f(x) = \langle f, k_x\rangle_\mathcal H $$$$ \\\;\\ \begin{aligned}&{\textbf{Definition:}} \\&\text{Let }\mathcal H\text{ be } \mathcal L_2(X)\text{ the Hilbert space of functions from }X\text{ to }\mathbb R\text{. A function }\mathcal K: X\times X \rightarrow \mathbb R\text{ is called a }\textbf{Reproducing Kernel}\\&\text{of }\mathcal H\text{ if it satisfies: } \\\;\\&\qquad\qquad \text{- }\forall x\in X,\;k_x = \mathcal K(\cdot, x) \in \mathcal H \qquad\text{( }k_x\text{ is called representer at } x\text{)} \\&\qquad\qquad \text{- }\forall x\in X,\forall f\in \mathcal H,\;\;\langle f, \mathcal K(\cdot, x)\rangle = f(x) \qquad\text{(Reproducing property)} \end{aligned} \\\;\\ $$Note that since $k_x\in\mathcal H$, this means that it is a function that maps $X\rightarrow \mathbb R$. So we have $k_x(y) = \mathcal K(x,y)$, and in particular for any $x, x' \in X$:
$$\mathcal K(x, x') = \langle\mathcal K(\cdot, x), \mathcal K(\cdot, x')\rangle_\mathcal H = \langle k_x, k_{x'}\rangle_\mathcal H $$where $k_x$ and $k_{x'}$ are the unique representative of $\delta_x$ and $\delta_{x'}$. Note also that since $\langle \cdot, \cdot\rangle_\mathcal H$ is an inner product (symmetric positive definite) also $\mathcal K$. The following properties hold:
Reproducing kernels are a generalization of the kernel function defined above. Kernels in fact are a subset of reproducing kernels and so every reproducing kernel is also a kernel. This can be simply proved considering the mapping function:
$$ \\\;\\ \begin{aligned} \psi \quad :\quad &X\rightarrow \mathcal H \\ & x\mapsto \mathcal K(\cdot, x) = k_x \end{aligned} \\\;\\ $$and so the RKHS is the feature space $\mathcal F$ defined above.
The link that connects all this stuff is the Moore-Aronszjan theorem:
$$\\\;\\ \begin{aligned}&{\textbf{Moore-Aronszajn Theorem:}} \\&\text{Let }\mathcal K:X\times X\rightarrow \mathbb R \text{ be a positive definite kernel. Then there is a unique RKHS }\mathcal H_\mathcal K\in \mathbb R^X \text{ with reproducing kernel } \mathcal K\end{aligned} \\\;\\ $$The representer theorem states:
$$\begin{aligned}&{\textbf{Representer Theorem:}} \text{ Consider any learning rule of the form } \\\;\\ &\qquad\qquad\qquad \mathbf w^* = \arg\min_\mathbf w \Big(f(\langle\mathbf w, \psi(\mathbf x_1\rangle)), \ldots , f(\langle\mathbf w, \psi(\mathbf x_m\rangle)) + \mathcal R(\mathbf w)\Big) \\\;\\ &\text{where }f:\mathbb R^m\rightarrow \mathbb R \text{ is an arbitrary function and } \mathcal R \text{ is a regularaziation term. Then:} \\\;\\&\qquad\qquad\qquad \exists\;\mathbb \alpha \in\mathbb R^m \;s.t.\;\mathbf w^* = \sum_{i=1}^m\alpha_i\psi(\mathbf x_i)\end{aligned}$$The basic idea to take advantage of kernels is to express PCA in the feature space in term of dot-products and then to exploit the duality between the kernel function and the dot product in order to formulate a version of the PCA that avoids the direct computation of the covariance matrix in the $\mathcal F$ space. The key idea is that we can express each principal component in the feature space $v_k^{(\psi)}$ as a linear combination of the points in the feature space.
The trick is that KernelPCA does not compute directly the principal components of the covariance matrix $X_\psi^TX_\psi$ in the feature space, but it computes the projection of our data onto those components.
$$ v_k^{(\psi)T}\psi(x_i) = \big(\sum_{j=1}^m \alpha_{j,k}\psi(x_j) \big )^T\psi(x_i) = \sum_{j=1}^m \alpha_{j,k}\big\langle\psi(x_i), \psi(x_j)\big\rangle =\sum_{j=1}^m \alpha_{j}\mathcal K(x_i, x_j) $$Considering that $\mathcal K(\mathbf x_i, \mathbf x_j)=\psi(\mathbf x_j)^T\psi(\mathbf x_i)$ we have that the projection is simply a linear combination of the column $i$ of the kernel matrix. The problem boils down to compute the factors $\alpha_{i,k}$, and we can state it as:
$$ m\lambda\mathbf \alpha = K\mathbf \alpha$$That is an eigendecomposition of the kernel matrix $K$ and where $m$ is the number of points.
Also in this case the algorithm is simple, and can be scomposed in three main steps:
The first part consist simply in the computation of the kernel matrix $K$ where each component $K_{i,j} = \mathcal K(x_i, x_j)$ for each couple of points in the dataset.
The second point is a bit trickier. PCA requires that the matrix on which is applied is zero-centerd. Even if the matrix $X$ is centered column-wise We cannot guarantee that the kernel matrix is centered as well. In this case we performe the following preprocesisng to the matrix $K$:
$$K \quad\leftarrow\quad K-{\mathbf {1_{m}}}K-K{\mathbf {1_{m}}}+{\mathbf {1_{m}}}K{\mathbf {1_{m}}}$$where the matrix $\mathbf {1_{m}}$ is an $m\times m$ matrix of the form:
$$\mathbf {1_{m}}= \frac{1}{m}\begin{bmatrix} 1&1&\ldots&1 \\ 1&1&\ldots&1 \\ \vdots&\vdots&\ddots&\vdots\\ 1&1&\ldots&1 \\ \end{bmatrix}\in\mathbb R^{m,m} $$Finally, also in this case we can solve the eigendecomposition problem of the kernel matrix through an SVD decomposition considering the matrix $K_{half} = K^{1/2}$ such that $K = K_{half}K_{half}^T$. In this case we can retrieve the eigendecomposition of $K$ throuhg the SVD decomposition of $K_{half}$
def poly_kernel(x1, x2, k=25):
return np.power(1 + np.dot(x1, x2), k)
def rbf_kernel(X, gamma):
sq_dists = pdist(X, 'sqeuclidean')
mat_sq_dists = squareform(sq_dists)
K = np.exp(-gamma *mat_sq_dists)
return K
def kernel_matrix(X, kernel_fun):
m = X.shape[0]
G = np.ones((m,m))
for i in range(m):
for j in range(i, m):
G[i,j] = kernel_fun(X[i], X[j])
G[j,i] = G[i,j]
return G
class myKernelPCA():
def __init__(self, n_components=None, use_std=False, solver='svd', kernel=rbf_kernel, **kwargs):
'''
n_components: int or float[0,1]
- if int is interpreted as the number of components to keep
- if float is interpreted as the percentage of variance to keep
use_std: bool, default=False
- if False X will be only centered column-wise (covariance PCA)
- if True X will be standardized considering also the std of each variable (correlation PCA)
'''
self.use_std = use_std
self.n_components = n_components
self.var_explained = None
self.solver = solver
self.kernel = kernel
self.kwargs = kwargs
def fit_transform(self, X):
self.fit(X)
return self.transform(X)
def fit(self, X):
'''
Performs computation of the principal components on the dataset X
X: np.array dim (m,n)
- m is the number of samples (observations) in the dataset
- n is the number of variables (features)
'''
self.n_samples, self.n_features = X.shape
K = self.kernel(X, **self.kwargs)
# centering
one_matrix = np.ones((self.n_samples, self.n_samples)) / self.n_samples
K = K - one_matrix.dot(K) - K.dot(one_matrix) + one_matrix.dot(K).dot(one_matrix)
if self.solver == 'svd':
G = la.fractional_matrix_power(K, 1/2)
G = np.real_if_close(G, tol=1)
U, S, V = la.svd(G) # SVD decomposition of X
V = V.T # eigenvectors of X^TX
self.lambdas = (S ** 2) / (n_samples - 1) # eigenvalues of X^TX
elif self.solver == 'eig':
# eigendecomposition
self.lambdas, V = la.eigh(K)
self.lambdas = np.real_if_close(self.lambdas, tol=1)
V = np.real_if_close(V, tol=1)
# Ordering eigs and eigenvectors
i_lambdas = np.argsort(self.lambdas)[::-1]
self.lambdas = self.lambdas[i_lambdas]
V = V[:, i_lambdas]
self.components_ = V
self.__find_components()
def __find_components(self):
'''
private method used to compute the number of components
'''
# compute variance explained for each component and cumulative variance explained
tot_variance = np.sum(self.lambdas)
self.relative_variance_explained = [x / tot_variance for x in self.lambdas]
self.cumulative_variance_explained = np.cumsum(self.relative_variance_explained)
if self.n_components is None:
self.n_components = len(self.lambdas)
elif type(self.n_components) is int:
self.n_components = self.n_components
elif 0.0 <= self.n_components and self.n_components <= 1.0:
# if self.n_components is float in [0,1], compute the first index of cumulative variance
# greater or equal to it
self.n_components = np.argmax(self.cumulative_variance_explained >= self.n_components) + 1
else:
# in all the other cases set by default to max possible
self.n_components = self.n_features
self.var_explained = self.cumulative_variance_explained[self.n_components - 1]
def transform(self, X, n_components=None):
'''
Projection of X into the principal component space
'''
if n_components is None or n_components > self.n_components:
n_components = min(X.shape)
return self.components_[:, :n_components]
def inv_transform(self, W, n_components=None):
'''
Reconstruction of the matrix from the principal component space
'''
if n_components is None or n_components >= self.n_components:
n_components = min(W.shape[1], self.n_components)
return W @ self.components_[:, :n_components].T
To test the method, we can use the function make_moons from scikit-learn library. From the plot we can see ho, also in thi case, even if the baricenters are not overlapped, simple PCA does not perform really well
However, we can see how the rbf kernel-PCA is able to deconstruct the original structure of the points, and that the points projected onto the kernel-principal-components are clearly linearly separable
ICA aims to find a base of independent components.
Independent component analysis can be considered as a sort of evolution of PCA. ICA as the name suggest, tries to find independent components making advantage of statistics of higher order than the covariance, like at example Kurtosis.
We assume that there exists a set of original signals $S$, such that each source $s_i\in S$ is independent from all the others $s_{j\neq i}\in S$. The dataset $X$ we have is considered to be as a set of observation, constructed as a linear mixture of these source signals $s_i$:
$$ X = AS$$The matrix $A$ is called mixing matrix, and is an unknown invertible matrix of mixing components.
The main goal of ICA is to recover an approximation of the unmixing matrix $W\approx A^{-1}$, in order to retrieve the set of source signals $S$ starting from the set of observations $X$.
$$ S\approx \hat S= WX $$Note that we have two unknowns in this problem: the mixing matrix $A$ and the set of sources $S$. The solution of this problem seems to be impossible since we have as observation only the matrix product of the two unknowns.
This problem can be solved in a divide-and-conquer fashion. First we focus on finding $A$. Also in this case the SVD comes in our help. Given the decomposition $A=U\Sigma V^T$ of the original mixing matrix, we have that the unmixing matrix can be written as:
$$ W = A^{-1} = V\Sigma^{-1}U^T $$We can now analyze the covariance matrix of observations $X$, and we have:
\begin{aligned}XX^T & = (AS)(AS)^T\\ & = (U\Sigma V^T\mathbf s)(U\Sigma V^T\mathbf s)^T \\ & = U\Sigma V^T \underbrace{S S^T}_{I} V\Sigma U^T \\ & = U\Sigma^2 U^T \end{aligned}Considering the eigenvalue decomposition $XX^T = EDE^T$ and comparing with the equation above we have that $U = E$ and and $\Sigma = D^{-1/2}$, as seen before for the PCA. We can rewrite the unmixing matrix, considering the resulots obtained:
$$ W = A^{-1} = V\Sigma^{-1}U^T = VD^{-1/2}E^T $$The application of the matrices $D^{-1/2}E^T$ is a classical operation in signal processing. This operation is called whitening:
$$ X_w = (D^{-1/2}E^T)X $$Whitening has the purpose to remove all linear dependencies in a data set (second order correlation) and normalizing the variance along each dimension. The resulting matrix $X_w$ is rotationally symmetric, like a sphere. For this reason this procedure is sometimes also called sphering.
$$X_wX_w^T = I_n = \begin{pmatrix} 1&0&\ldots&0\\ 0&1&\ldots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\ldots&1 \end{pmatrix}\in \mathbb R^{n,n}$$Now, the ICA problem boils down to find a single rotation matrix V, such that:
$$ \hat {S} = VX_w $$The whitening step has removed all the linear dependencies between all pairs of variables based on second-order correlation; to recover $V$ we need to analyse other measures of dependency.
Statistical independence is the strongest measure of dependence between random variables. It requires that all order statistics of correlation (not only second) are vanishing. Only in this case we can have that the jointly probability distribution factorize to the product of the singular probability distributions. In the case of our original signals we require that:
$$ P(S) = \prod_{i}P(s_i) $$Since with the whitening step we have removed the second-order correlations, what we want to do now is to find a rotation matrix $V$ that removes all the higher-order correlations. The solution to this problem comes from information theory. It can be proved that the matrix we seek is given by:
$$ V = \arg\min_{V} \sum_i H\big[(VX_w)_i\big] $$where $\hat {s_i} = (VX_w)_i $ is the i-th component of the recovered signals $\hat S$, and $H[y] = -\int P(y)\log_2 P(y) dy $ is the entropy, a measure of the amount of uncertaint about the distribution $P(y)$.
Non-gaussianity serves as a proxy for statistical independence. In particular a solution of the equation above finds the rotation that maximizes the "non-Gaussianity" of the transformed data. The deviation of a probability distribution from a Gaussian is comoonly measured by the negentropy. Maximizing the negentropy (non-Gaussianit) is equivalent to minimize the sum of the marginal entropy.
FastICA is an efficient and popular algorithm for independent component analysis invented by Aapo Hyvärinen at Helsinki University of Technology.
This algorithm is composed by two main steps:
component extraction step
In the second part of we want to find the directions $w_i$ that maximizes a measure of non-Gaussianity of the vector $w_i^TX$. To measure non-Gaussianity, FastICA relies on a nonquadratic nonlinear function $f(u)$, its first derivative $g(u)$, and its second derivative $g'(u)$. Typical choices of these functions are:
$$\begin{aligned} &f(u)=\log \cosh(u),&\quad g(u)=\tanh(u),&\quad {g}'(u)=1-\tanh ^{2}(u),\\ &f(u) = -e^{-u^2/2},&\quad g(u) = u e^{-u^2/2},&\quad {g}'(u) = (1-u^2) e^{-u^2/2} \end{aligned}$$
The computation of the directions is done iteratively. For each direction $i$ we start from a random vector $w_i^{(0)}$. At each step we compute the next direction $w_i^{(k+1)}$ starting from the current one $w_i^{(k)}$ and the non-linear function $f(u)$:
$$w_i^{(k+1)} = E[Xg({w_i^{(k)}}^TX)^T] - E[g'(w^TX)]w_i^{(k)}\\ w_i^{(k+1)} = \frac{w_i^{(k+1)}}{\|w_i^{(k+1)}\|} $$Then we reothogonalize with respect to the directions already computed. The stopping criterion of the iteration can be given as a tolerance (convergence of $|w_i^{(k+1)} - w_i^{(k)}|$) or as number of maximum iterations.
class myFast_ICA():
def __init__(self, n_components=None, g_funct='tanh', tol=1e-8, num_iteration=1000, rng=None, seed=None):
self.g_funct = g_funct
self.n_components = n_components
self.tol = tol
self.num_iteration = num_iteration
if rng != None:
self.rng = rng
elif seed != None:
self.rng = np.random.default_rng(seed)
else:
self.rng = np.random.default_rng()
def fit_transform(self, X, y=None):
self.fit(X)
return self.transform(X)
def fit(self, X):
tol = self.tol
num_iteration = self.num_iteration
X_ = np.array(X.T)
n_samples, n_features = X_.shape
if self.n_components==None or self.n_components >= min(n_samples, n_features):
self.n_components = min(n_samples, n_features)
# center the matrix
mu = X_.mean(1, keepdims = True)
X_centered = X_ - mu
self.mean = mu
# prewhitening with D^-1/2 U^T
U, S, V = la.svd(X_centered, full_matrices=False)
Z = (U / S).T[:self.n_components] # Note that given XX^T = EDE^T we have that D = S^2 and E = U
self.whitening_matrix = Z
X_whitened = np.dot(Z, X_centered)* np.sqrt(n_features)
# find components
W = np.zeros((self.n_components, self.n_components))
for p in range(self.n_components):
# initial random guess
wp = self.rng.random(size = self.n_components)
wp = wp / la.norm(wp)
for i in range(num_iteration):
wp_old = wp
# update vector
u = wp_old.T @ X_whitened
g, g_der = self.__G(u)
wp = (X_whitened * g).mean(1) - g_der.mean() * wp_old
# re-orhtogonalization through modified-Gram-Schmidt
for j in range(p):
wp -= np.linalg.multi_dot([wp, W[:j+1].T, W[:j+1]])
# normalization
wp = wp / la.norm(wp);
if la.norm(wp_old - wp ) < tol:
break
W[p, :] = wp
self.V = W
self.W = W @ Z
self.W = self.W.T
def transform(self, X):
B = X - self.mean.T
return np.dot(B, self.W)
def __G(self, x):
if self.g_funct == 'tanh':
tanh = np.tanh(x)
return tanh, 1 - tanh * tanh
elif self.g_funct == 'exp':
exp = np.exp(-(x**2)/2)
return x * exp, ((1 - x**2) * exp)
ICA is used for Blind Source Separation. A typical application in the sound field is the so called cocktail party problem. Imagine we have different sound signals and different observations, given from example by some microphones. Each observation is a linear combination of the sources, where the coefficient associated at each source depends on different factors like the distance from the microphone and the microphone settings. The goal is to retrieve the original signals from the observations
To test the goodness of my implementation, I have tried to apply it to a simple example of a cocktail party problem, following the example of present in the scikit library's documentation.
In this example, three source signals are generated: a sinusoidal signal, a sqaure wave pulse and a sawtooth signal.
The three original signals are then mixed with a random mixing matrix $A$ to generate the observations $X$. Finally both sklearn implementation of FastICA and mine are used to recover the original source signals
import IPython.display as ipd
from scipy.io.wavfile import read
AUDIO_NAME = ['train', 'laugh', 'gong']
fig = make_subplots(rows=3, cols=1)
S = []
for i,audio in enumerate(['audio/original_train.wav', 'audio/original_laughter.wav', 'audio/original_gong.wav']):
x = read(audio)
S.append(x[1])
fig.add_trace(go.Scatter(mode='lines',x=np.arange(len(x[1])), y = x[1], name=AUDIO_NAME[i]), col=1, row=i+1)
S =np.array(S)
fig.update_layout(title='Original audio sources')
fig.show()
for i,s in enumerate(S):
print(f'ORIGINAL: {AUDIO_NAME[i]}')
ipd.display(ipd.Audio(s, rate=x[0]))
To test the goodness of the algorithm we can import three sample audio signals. The three audio signals are plotted above. Then with a random matrix we can simulate the mixing of these signals. Note how in a real application, tipically the only thing we have are the mixed signals in the plot belowe, and from these we want to retrieve the source signals on the top. As we can see even from this simple example, this is not at all a trivial problem. Looking at the three mixed signals, is very difficult to spot the underlying original signals.
import numpy as np
A = np.random.rand(3,3)
X = np.dot(A, S)
fig = make_subplots(rows=3, cols=1)
for i,mix in enumerate(X):
fig.add_trace(go.Scatter(mode='lines',x=np.arange(len(mix)), y = mix, name=f'mix-{i}'), col=1, row=i+1)
fig.update_layout(title='Mixed signals (microphones)')
fig.show()
for i,s in enumerate(X):
print(f'MICROPHONE: {i+1}')
ipd.display(ipd.Audio(s, rate=x[0]))
myica = myFast_ICA(n_components=3, tol=1e-15, num_iteration=2000)
myica_S = myica.fit_transform(X.T)
fig = make_subplots(rows=3, cols=1)
for i in range(myica_S.shape[1]):
fig.add_trace(go.Scatter(mode='lines',x=np.arange(myica_S.shape[0]), y = myica_S[:,i], name=f'reconstructed-{i}'), col=1, row=i+1)
fig.update_layout(title='Reconstructed signals (myICA)')
fig.show()
for i in range(myica_S.shape[1]):
print(f'RECONSTRUCTED: {i+1}')
ipd.display(ipd.Audio(myica_S[:,i], rate=x[0]))
Typical application of ICA includes the analysis of photographic images, medical signals (EEG, MEG, MRI), biological assays and obviously audio signal procesing. For example considering medical signals a tipical application regards 'artifact removals (https://sccn.ucsd.edu/~jung/Site/EEG_artifact_removal.html)
[1] fifa19 dataset https://www.kaggle.com/karangadiya/fifa19
[2] Principal Component Analysis, Herve Abdi and Lynne J. Williams https://personal.utdallas.edu/~herve/abdi-awPCA2010.pdf
[3] http://alexhwilliams.info/itsneuronalblog/2018/02/26/crossval/
[4] Representer Kernel Hilbert space https://www.math.unipd.it/~demarchi/TAA1718/RKHS_presentazione.pdf
[5] FastICA pseudocode https://en.wikipedia.org/wiki/FastICA